Fast Approximate Search in Large Dictionaries
نویسندگان
چکیده
The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to P does not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a “universal Levenshtein automaton,” we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixed-distance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.
منابع مشابه
Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing
We introduce a novel dictionary optimization method for high-dimensional vector quantization employed in approximate nearest neighbor (ANN) search. Vector quantization methods first seek a series of dictionaries, then approximate each vector by a sum of elements selected from these dictionaries. An optimal series of dictionaries should be mutually independent, and each dictionary should generat...
متن کاملFast Large-Scale Approximate Graph Construction for NLP
Many natural language processing problems involve constructing large nearest-neighbor graphs. We propose a system called FLAG to construct such graphs approximately from large data sets. To handle the large amount of data, our algorithm maintains approximate counts based on sketching algorithms. To find the approximate nearest neighbors, our algorithm pairs a new distributed online-PMI algorith...
متن کاملAdvances in multirate filter bank structures and multiscale representations
We propose a new framework to extract the activity-related component in the BOLD functional Magnetic Resonance Imaging (fMRI) signal. As opposed to traditional fMRI signal analysis techniques, we do not impose any prior knowledge of the event timing. Instead, our basic assumption is that the activation pattern is a sequence of short and sparsely-distributed stimuli, as is the case in slow event...
متن کاملA Breadth-First Representation for Tree Matching in Large Scale Forest-Based Translation
Efficient data structures are necessary for searching large translation rule dictionaries in forest-based machine translation. We propose a breadth-first representation of tree structures that allows trees to be stored and accessed efficiently. We describe an algorithm that allows incremental search for trees in a forest and show that its performance is orders of magnitude faster than iterative...
متن کاملEfficient Similarity Search via Sparse Coding
This work presents a new indexing method using sparse coding for fast approximate Nearest Neighbors (NN) on high dimensional image data. To begin with we sparse code the data using a learned basis dictionary and an index of the dictionary’s support set is next used to generate one compact identifier for each data point. As basis combinations increase exponentially with an increasing support set...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computational Linguistics
دوره 30 شماره
صفحات -
تاریخ انتشار 2004